An apparatus for peer recovery, the apparatus comprising: 



a detection module configured to detect a failure of a first 
computer; 

a recovery coordination module configured to accept and reject 
requests from a recovery module to register as the counterpart of the first 
computer, and unregister the recovery module as the counterpart of the 
first computer upon request; and 

a recovery module configured to register with the recovery 
coordination module as the counterpart of the first computer, perform a 
recovery operation of the first computer, and unregister with the recovery 
coordination module as the counterpart of the first computer responsive to 
the detection module detecting the failure of the first computer. 

2. The apparatus of claim 1 , wherein the recovery module initiates peer recovery 
automatically. 

3 . The apparatus of claim 1 , wherein the recovery module initiates peer recovery 
responsive to an operator command. 

4. The apparatus of claim 1, the recovery module comprising: 




computer, back out an in-flight transaction update, and release a data 



counterpart of the first computer; and 



resource locked by the first computer. 



an initialization module configured to initialize and start the 



a backout module configured to retrieve private log data of the first 
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5. The apparatus of claim 1, wherein the detection module, the recovery 
coordination module, and the recovery module reside within a second computer. 
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6. The apparatus of claim 1 , wherein upon the receipt of a request for registering 
as the counterpart of the first computer, the recovery coordination module changes the status 
of the recovery module of the first computer from inactive to active. 

7. The apparatus of claim 1 , wherein the recovery coordination module rejects a 
request for registering as the counterpart of the first computer once the status of the recovery 
module of the first computer is made active. 

8. The apparatus of claim 1, wherein upon the receipt of a request for 
unregistering as the counterpart of the first computer, the recovery coordination module 
changes the status of the recovery module of the first computer from active to inactive. 

9. The apparatus of claim 1, the detection module further comprising a log list 
moduleconfigured to receive a status signal from at least one computer, wherein the detection 
module identifies the failed computer when the log list module does not receive the status 
signal from the failed computer within a pre-specified time interval. 



IBM Docket No.: SJ0920030069 



-20- 



Kunzler & Associates Docket No.: 1200.2.100 



10. A system for cluster-wide peer recovery, the system comprising: 
a first computer; 

a second computer in communication with the first computer 
configured to detect a failure of the first computer, wherein the second 
computer registers as the counterpart of the failed first computer, recovers 
the operation of the failed first computer, and unregisters as the 
counterpart of the failed first computer; 

a shared memory controller in communication with the first 
computer and the second computer configured to store and retrieve 
computer component status and log data, the shared memory controller 
further configured to prevent unauthorized access to private log data and to 
lock data resources; and 

a disk configured to store and retrieve user data and system data in 
the disk's storage media for the cluster. 



1 1. The system of claim 10, the second computer further configured to initiate 
peer recovery automatically. 

12. The system of claim 10, the second computer further configured to initiate 
eg peer recovery responsive to an operator command. 

O 5 S < 13. The system of claim 10, wherein the shared memory controller comprises a 

^ £ | fc dedicated processor and a memory module. 

3 S3 

£ « w 14. The system of claim 1 3, wherein the memory module is nonvolatile memory. 
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1 5 . The system of claim 1 0, the second computer further configured to recover the 
operation of the first computer by initializing and starting the counterpart of the first 
computer, retrieving the private log data of the first computer, backing out an in-flight 
transaction update of the first computer, and releasing a data resource locked by the first 
computer. 

16. The system of claim 10, the second computer further configured to block a 
third computer and the first computer from registering as the counterpart of the first 
computer. 

17. The system of claim 10, wherein the first computer and the second computer 
communicate point-to-point, using a channel-to-channel communication connection 
comprising an inbound signaling path and an outbound signaling path. 

18. The system of claim 10, wherein the computers use a symmetric 
multiprocessor configuration. 

19. The system of claim 10, wherein the computers use an asymmetric 
multiprocessor configuration. 
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20. A computer readable storage medium comprising computer readable code 
configured to carry out a method for peer recovery, the method comprising: 



detecting a failure of a first computer; 

registering a counterpart of the first computer; 

recovering the operation of the first computer by the counterpart; 

and 

unregistering the counterpart of the first computer. 



21. The computer readable storage medium of claim 20, the method further 
comprising computer readable code configured to initiate the peer recovery automatically. 

22. The computer readable storage medium of claim 20, the method further 
comprising computer readable code configured to initiate the peer recovery responsive to an 
operator command. 

23. A computer readable storage medium of claim 20, the method for recovering 
the operation of the first computer by the counterpart further comprising: 



initializing and starting the counterpart; 



retrieving private undo log data of the first computer; 




backing out an in-flight transaction update of the first computer; 



and 



releasing a data resource locked by the first computer. 



24. 



The 



computer readable storage medium of claim 20, the method further 



comprising blocking the recovery modules of a third computer and the first computer from 



registering as the counterpart of the first computer. 
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25. A method for peer recovery, the method comprising: 
detecting a failure in a first computer; 
registering a counterpart of the first computer; 
recovering the operation of the first computer by the counterpart; 

and 

unregistering the counterpart of the first computer. 
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26. The method of claim 25, the method further comprising blocking the recovery 
modules of a third computer and the first computer from registering as the counterpart of the 
first computer. 

27. The method of claim 25, the method of recovering the operations of the first 
computer by the counterpart further comprising: 



initializing and starting the counterpart of the first computer; 

retrieving private undo log data of the first computer; 

backing out an in-flight transaction update of the first computer; 

and 

releasing a data resource locked by the first computer. 



28. The method of claim 25, further comprising initiating peer recovery 
automatically. 

29. The method of claim 25, further comprising initiating peer recovery 
responsive to an operator command. 

30. A apparatus for peer recovery, the apparatus comprising: 




means for detecting a failure of a first computer; 



means for registering a first counterpart of the first computer; 



means for blocking a second counterpart from registering as the 



counterpart of the first computer; 



means for recovering the operation of the first computer by the first 



counterpart; and 



means for unregistering the first counterpart of the first computer. 
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